<<<<<<< HEAD ======= >>>>>>> release/0.1.0 clusterExperiment Vignette

Contents

1 Introduction

The goal of this package is to encourage the user to try many different clustering algorithms in one package structure. We give tools for running many different clusterings and choices of parameters. We also provide visualization to compare many different clusterings and algorithm tools to find common shared clustering patterns. We implement common post-processing steps unrelated to the specific clustering algorithm (e.g. subsampling the data for stability, finding biomarkers related the clustering).

We also provide a class clusterExperiment that inherits from SummarizedExperiment class to store the many clusterings and related information.

The results from our methods will make an object ClusterExperiment that expands the SummarizedExperiment object, meaning the commands for accessing the data of a SummarizedExperiment will carry over to the ClusterExperiment.

All of our methods also have a barebones version that allows input of matrices and greater control. This comes at the expense of the user having to manage and keep track of the clusters, input data, transformation of the data, etc. We do not discuss these barebone versions in this tutorial. Instead, we focus in this tutorial on using SummarizedExperiment object as the input and working with the resulting ClusterExperiment object. See the help pages of the functions for more about the versions of the functions that allow for matrix input.

1.1 The clustering workflow

The package encodes many common practices that are shared across clustering algorithms, like subsampling the data, computing silhouete width, sequential clustering procedures, and so forth. We describe here the basic expected usage of the package. Additional functionality is available in the clusterSingle function, and the user should see the documentation in the help pages for that function.

The basic intended clustering workflow is

The basic premise of our workflow is to find small, robust clusters of samples, and then unify them into larger clusters as relevant. We find that many algorithmic methods for choosing the appropriate number of clusters for methods err on the side of a few number of clusters. However, we find in practice that we tend to prefer to err on finding many clusters and then merging them based on examining the data.

Users can also run or add their own clusters to this workflow at different stages. However, there is special functionality to ease in quickly visualizing and managing the results of this workflow.

1.3 Visualization Tools

We provide a visualization to compare many clusterings of the same data in the function plotClusters. We also provide some functions that are wrappers for common visualization tasks in clustering gene expression data. We provide an heatmap function, plotHeatmap that is an interface to the aheatmap function in NMF package.

2 Quickstart

We will go quickly through the standard steps of clustering using the clusterExperiment package, before turning to more details. The standard workflow we envision is the following:

2.1 Data example

We will make use of a single cell RNA sequencing experiment made available in the scRNAseq package at github/drisso. For the current version of this vignette, we used version 0.0.8 of the scRNAseq package, available at (https://github.com/drisso/scRNAseq/releases/download/v0.0.8/scRNAseq_0.0.8.tar.gz). Once downloaded, the package can be installed within R with the following command.

install.packages("scRNAseq_0.0.8.tar.gz", repos=NULL)
library(scRNAseq)
data("fluidigm")

We will use the fluidigm dataset (see help("fluidigm")). This dataset is stored as a SummarizedExperiment object. We can access the data with assay and metadata on the samples with colData.

assay(fluidigm)[1:5,1:10]
##          SRR1275356 SRR1274090 SRR1275251 SRR1275287 SRR1275364 SRR1275269
## A1BG              0          0          0          0          0          0
## A1BG-AS1          0          0          0          0          0          0
## A1CF              0          0          0          0          0          0
## A2M               0          0          0         31          0         46
## A2M-AS1           0          0          0          0          0          0
##          SRR1275263 SRR1275242 SRR1275338 SRR1274117
## A1BG              0          0          0          0
## A1BG-AS1          0          0          0          0
## A1CF              0          0          0          0
## A2M               0          0          0         29
## A2M-AS1           0          0          0        133
colData(fluidigm)[,1:5]
## DataFrame with 130 rows and 5 columns
##               NREADS  NALIGNED    RALIGN TOTAL_DUP    PRIMER
##            <numeric> <numeric> <numeric> <numeric> <numeric>
## SRR1275356  10554900   7555880   71.5862   58.4931 0.0217638
## SRR1274090    196162    182494   93.0323   14.5122 0.0366826
## SRR1275251   8524470   5858130   68.7213   65.0428 0.0351827
## SRR1275287   7229920   5891540   81.4884   49.7609 0.0208685
## SRR1275364   5403640   4482910   82.9609   66.5788 0.0298284
## ...              ...       ...       ...       ...       ...
## SRR1275259   5949930   4181040   70.2705   52.5975 0.0205253
## SRR1275253  10319900   7458710   72.2747   54.9637 0.0205342
## SRR1275285   5300270   4276650   80.6873   41.6394 0.0227383
## SRR1275366   7701320   6373600   82.7600   68.9431 0.0266275
## SRR1275261  13425000   9554960   71.1727   62.0001 0.0200522
NCOL(fluidigm) #number of samples
## [1] 130

2.2 Step 0: Filtering and normalization

While there are 130 samples, there are only 65 cells, because each cell is sequenced twice as different sequencing depth. We will limit to those samples corresponding to the sequencing of the cells with high sequencing depth.

se <- fluidigm[,colData(fluidigm)[,"Coverage_Type"]=="High"]

We also filter out lowly expressed genes: we retain only those genes with at least 10 reads in at least 10 cells.

wh_zero <- which(rowSums(assay(se))==0)
pass_filter <- apply(assay(se), 1, function(x) length(x[x >= 10]) >= 10)
se <- se[pass_filter,]
dim(se)
## [1] 7066   65

This removed 19196 genes out of 26262. We now have 7066 genes (or features) remaining. Notice that it is important to remove genes with all zero in all samples (we had 9681 genes which were zero in all samples here). Otherwise, PCA dimensionality reductions and other implementations may have a problem.

<<<<<<< HEAD

Normalization is an important step in any RNA-seq data analysis and many different normalization methods have been proposed in the literature. Comparing normalization methods or finding the best performing normalization in this dataset is outside of the scope of this vignette. Instead, we will use a simple quantile normalization that will make our clustering reflect the biology rather than the difference in sequencing depth among the different samples.

library(limma)
## Error in library(limma): Package 'limma' version 3.27.14 cannot be unloaded
fq <- round(normalizeQuantiles(assay(se)))
## Error in eval(expr, envir, enclos): could not find function "normalizeQuantiles"
assays(se) <- list(normalized_counts=fq)
## Error in eval(expr, envir, enclos): object 'fq' not found
=======

Normalization is an important step in any RNA-seq data analysis and many different normalization methods have been proposed in the literature. Comparing normalization methods or finding the best performing normalization in this dataset is outside of the scope of this vignette. Instead, we will use a simple quantile normalization that will at least make our clustering reflect the biology rather than the difference in sequencing depth among the different samples.

library(limma)
fq <- round(normalizeQuantiles(assay(se)))
assays(se) <- list(normalized_counts=fq)
>>>>>>> release/0.1.0

2.3 Step 1: Clustering with clusterMany

clusterMany lets the user quickly pick between many clustering options and run all of the clusterings in one single command. In the quick start we pick a simple set of clusterings based on varying the dimensionality reduction options. The way to pick designate which options to vary is to give multiple values to an argument. Here is our call to clusterMany:

options(getClass.msg=FALSE) #get rid of annoying messages about cache until fixed internally in R
library(clusterExperiment)
ce<-clusterMany(se, clusterFunction="pam",ks=5:10,findBestK=FALSE,removeSil=c(FALSE),
      isCount=TRUE,dimReduce=c("PCA","mostVar"),nVarDims=c(100,500,1000),
      nPCADims=c(5,15,50),run=TRUE)

In this call to clusterMany we made the follow choices about what to vary:

Other possible options we kept fixed to reduce the number of clusterings by giving only a single value to the argument:

We will also set isCount=TRUE to indicate that our input data are counts. This means that operations for clustering and visualizations will internally transform the data as \(log_2(x+1)\) (We could have alternatively explicitly set a transformation by giving a function to the transFun argument, for example if we wanted \(\sqrt(x)\) or \(log(x+\epsilon)\) or just function(x){x} for the identify).

ce<-clusterMany(se, clusterFunction="pam",ks=5:10,findBestK=FALSE,removeSil=c(FALSE),
      isCount=TRUE,dimReduce=c("PCA","mostVar"),nVarDims=c(100,500,1000),
      nPCADims=c(5,15,50),run=TRUE)

We would like to visualize this clustering using the plotClusters command. We will also add some information about the samples saved in the colData slot (inherited from our original fluidigm object). We will choose the columns “Biological_Condition” and “Cluster2” from colData, which correspond to the original biological condition of the experiment, and the clusters reported in the original paper, respectively. We can give these names to the sampleData object, and it will retreive them from the colData. It is also useful to change the amount of space to allow for the labels of the clusterings, so we will reset the mar option in par (we also change axisLine argument for this reason).

=======
  • we used ‘pam’ for all clustering (clusterFunction)
  • we let the algorithm pick the best \(k\) ( from 2-10) via silhouette distance (findBestK=TRUE and ks=2:10)
  • we chose to unassign samples from clusters if they have negative silhouette scores (removeSil=TRUE). Such unassigned samples will be given the value -1.
  • we left the default which requiring clusters to be of size at least 5 (see help pages); smaller clusters will be ignored and samples assigned to them given the unassigned value of -1.
  • We also set isCount=TRUE to indicate that our input data are counts. This means that operations for clustering and visualizations will internally transform the data as \(log_2(x+1)\) (We could have alternatively explicitly set a transformation by giving a function to the transFun argument, for example if we wanted \(\sqrt(x)\) or \(log(x+\epsilon)\) or just function(x){x} for the identify).

    We can visualize these clusterings using the plotClusters command. We will also add some information about the samples saved in the colData slot (inherited from our original fluidigm object). We will choose the columns “Biological_Condition” and “Cluster2” from colData, which correspond to the original biological condition of the experiment, and the clusters reported in the original paper, respectively. We can give these names to the sampleData object, and it will retreive them from the colData. It is also useful to change the amount of space to allow for the labels of the clusterings, so we will reset the mar option in par (we also change axisLine argument for this reason).

    >>>>>>> release/0.1.0
    defaultMar<-par("mar")
    plotCMar<-c(1.1,8.1,4.1,1.1)
    par(mar=plotCMar)
    plotClusters(ce,main="Clusters from clusterMany", whichClusters="workflow", sampleData=c("Biological_Condition","Cluster2"), axisLine=-1)
    <<<<<<< HEAD

    =======

    >>>>>>> release/0.1.0

    This plot shows the samples in the columns, and different clusterings on the rows. Each sample is color coded based on its clustering for that row, where the colors have been choosen to try to match up clusters across different clusterings that show large overlap. Moreover, the samples have been ordered so that each subsequent clustering (starting at the top and going down) will try to order the samples to keep the clusters together, without rearranging the clustering blocks of the previous clustering/row.

    The labels shown are those given by clusterMany but can be a bit much for plotting. We can assign new labels if we prefer, for example to be more succinct, by changing the clusterLabels of the object (note we are permanently changing the labels here within the ClusterExperiment object). We choose to remove “Features” as being too wordy:

    cl<-clusterLabels(ce)
    cl<-gsub("Features","",cl)
    clusterLabels(ce)<-cl
    par(mar=plotCMar)
    plotClusters(ce,main="Clusters from clusterMany", whichClusters="workflow", sampleData=c("Biological_Condition","Cluster2"), axisLine=-1)

    We can see that some clusters are fairly stable across different choices of dimensions while others can vary dramatically. It is also clear that with higher dimensions, removing negative silhouette values results in more samples being removed – a single cutoff of silhouette distance is not robust across the dimension of the clustering routine.

    We have set whichClusters="workflow" to only plot clusters created from the workflow. Right now that’s all there are anyway, but as commands get rerun with different options, other clusterings can build up in the object (see discussion in this section about how multiple calls to workflow are stored). So setting whichClusters="workflow" means that we are only going to see our most recent calls to the workflow. There are many different options for how to run plotClusters discussed in in the detailed section on plotClusters, but for now, this plot is good enough for a quick visualization.

    2.3.1 The output

    The output of clusterMany is a ClusterExperiment object. This is a class built for this package and explained in the section on ClusterExperiment Objects. In the object ce the clusters are stored, names and colors for each cluster within a clustering are assigned, and other information about the clustersins. Furthermore, all of the information in the original SummarizedExperiment are retained.

    There are many accessor functions that help you get at the information in a ClusterExperiment object and some of the most relevant are described in the section on ClusterExperiment Objects. (ClusterExperiment objects are S4 objects, and are not lists).

    For right now we will only mention the most basic such function that retrieves the actual cluster assignments. This is the clusterMatrix function, that returns a matrix where the columns are the different clusterings and the rows are samples. Within a clustering, the clusters are encoded by consecutive integers.

    head(clusterMatrix(ce)[,1:3])
    <<<<<<< HEAD
    ##      nVarFeatures=100,k=5 nVarFeatures=500,k=5 nVarFeatures=1000,k=5
    ## [1,]                    1                    1                     1
    ## [2,]                    2                    1                     1
    ## [3,]                    3                    2                     2
    ## [4,]                    1                    1                     1
    ## [5,]                    3                    2                     2
    ## [6,]                    4                   -1                     2

    Notice that some of the samples are assigned the value of -1. -1 has the significance of encoding samples that are not assigned to any cluster. Why this is the case depends on the underlying choices of the clustering routine. In this case, we chose removeSil=TRUE so that samples with negative silhouette distance from their originally assigned clusters are discarded.

    =======
    ##      nVar=100,k=5 nVar=500,k=5 nVar=1000,k=5
    ## [1,]            1            1             1
    ## [2,]            1           -1            -1
    ## [3,]            2            2             2
    ## [4,]            3            3             1
    ## [5,]            4            4             2
    ## [6,]            4            2             2

    Because we have changed clusterLabels above, these new cluster labels are shown here. Notice that some of the samples are assigned the value of -1. -1 has the significance of encoding samples that are not assigned to any cluster. Why this is the case depends on the underlying choices of the clustering routine. In this case, we chose removeSil=TRUE so that samples with negative silhouette distance from their originally assigned clusters are discarded. Furthermore, the default in clusterMany set the minimum size of a cluster to be 5, which can also result in -1 assignments.

    >>>>>>> release/0.1.0

    (Another special value is -2 discussed in the section on ClusterExperiment objects

    2.4 Step 2: Find a consensus with combineMany

    To find a consensus cluster across the many different clusterings created by clusterMany the function combineMany can be used next.

    ce<-combineMany(ce)
    ## Warning in combineMany(ce): no clusters specified to combine, using results
    ## from clusterMany

    Notice we get a warning that we did not specify any clusters to combine, so it is using the default – those with clusterTypes="clusterMany" that we got in the previous step of the workflow.

    If we look at the clusterMatrix of the returned ce object, we see that the new cluster from combineMany has been added to the existing clusterings. This is the basic strategy of the functions in this package. Any clustering that is created is added to existing clusterings, so the user does not need to keep track of past clusterings and can easily compare what has changed.

    dim(clusterMatrix(ce))
    ## [1] 65 37
    head(clusterMatrix(ce)[,1:3])
    <<<<<<< HEAD
    ##      combineMany nVarFeatures=100,k=5 nVarFeatures=500,k=5
    ## [1,]           1                    1                    1
    ## [2,]          -1                    2                    1
    ## [3,]          -1                    3                    2
    ## [4,]           1                    1                    1
    ## [5,]          -1                    3                    2
    ## [6,]          -1                    4                   -1
    par(mar=plotCMar)
    plotClusters(ce,whichClusters="workflow")

    The default result of combineMany is not usually a great choice, and certainly isn’t helpful here. The resulting clusters leave most samples unassigned. This is because the default way of combining is very conservative – it require samples to only be in the same cluster if they are in the same clusters in every clustering, and makes clusters that satisfy this requirement. This is quite stringent. We can vary this by setting proportion argument to indicate the minimum proportion of times they should be together with other samples in the cluster they are assigned to. Explicit details on how this is done will be discussed in the section on combineMany.

    ce<-combineMany(ce,proportion=0.7)
    =======
    ##      combineMany nVar=100,k=5 nVar=500,k=5
    ## [1,]          -1            1            1
    ## [2,]          -1            1           -1
    ## [3,]          -1            2            2
    ## [4,]          -1            3            3
    ## [5,]          -1            4            4
    ## [6,]          -1            4            2
    par(mar=plotCMar)
    plotClusters(ce,whichClusters="workflow")

    The default result of combineMany is not usually a great choice, and certainly isn’t helpful here. The resulting clusters leave most samples unassigned (white in the above plot). This is because the default way of combining is very conservative – it require samples to only be in the same cluster if they are in the same clusters in every clustering, and makes clusters that satisfy this requirement. This is quite stringent. We can vary this by setting proportion argument to indicate the minimum proportion of times they should be together with other samples in the cluster they are assigned to. Explicit details on how combineMany makes these clusters is discussed in the section on combineMany.

    So let’s label the one we found “combineMany, default” and then create a new one. (Making a label will make it easier to keep track later).

    wh<-which(clusterLabels(ce)=="combineMany")
    if(length(wh)!=1) stop() else clusterLabels(ce)[wh]<-"combineMany,default"
    ce<-combineMany(ce,proportion=0.7)
    >>>>>>> release/0.1.0
    ## Warning in combineMany(ce, proportion = 0.7): no clusters specified to
    ## combine, using results from clusterMany
    par(mar=plotCMar)
    plotClusters(ce,whichClusters="workflow")
    <<<<<<< HEAD

    We can also visualize the proportion of times these clusters were together across these clusterings (this information was made and stored in the ClusterExperiment object when we called combineMany as long as proportion value is <1):

    plotCoClustering(ce)

    =======

    We see that more clusters are detected. Those that are still not assigned a cluster from combineMany clearly vary across the clusterings as to whether the samples are clustered together or not. Varying the proportion argument will adjust whether some of the unclustered samples get added to a cluster. We also note that there is also a default of minSize=5 for this command. We could reduce that requirement as well and more of the unsampled clusters would be grouped into a clustering. Again, we’ll label our old one “combineMany,0.7” before creating a new one.

    wh<-which(clusterLabels(ce)=="combineMany")
    if(length(wh)!=1) stop() else clusterLabels(ce)[wh]<-"combineMany,0.7"
    ce<-combineMany(ce,proportion=0.7,minSize=3)
    ## Warning in combineMany(ce, proportion = 0.7, minSize = 3): no clusters
    ## specified to combine, using results from clusterMany
    par(mar=plotCMar)
    plotClusters(ce,whichClusters="workflow",main="Min. Size=3")

    We can also visualize the proportion of times these clusters were together across these clusterings (this information was made and stored in the ClusterExperiment object when we called combineMany as long as proportion value is <1):

    plotCoClustering(ce)

    This visualization can help in determining whether to change the value of proportion (though see combineMany for how -1 assignments affect combineMany)

    >>>>>>> release/0.1.0

    2.5 Step 3: Merge clusters together with makeDendrogram and mergeClusters

    It is not uncommon in practice to create forty or more clusterings with clusterMany. In which case, the results of combineMany can often still result in too many small clusters. We might wonder if they are necessary or could be logically combined together. We could change the value of proportion in our call to combineMany. But we have found that it is often after looking at the clusters and how different they look on individual genes that we best make this determination, rather than the proportion of times they are together in different clustering routines.

    For this reason, we often find the need for an additional clustering step that merges clusters together that are not different, based on running tests of differential expression between the clusters found in combineMany. We often display and use both sets of clusterings side-by-side (that from combineMany and that from mergeClusters)

    mergeClusters needs a hierarchical clustering of the clusters. With this hiearchy it then goes progressively up that hierarchy, deciding whether two adjacent clusters can be merged. The function makeDendrogram makes such a hierarchy between clusters (by applying hclust to the mediods of the clusters). Because the results of mergeClusters are so dependent on that hierarchy, we require the user to call makeDendrogram rather than calling it internally. This is because different options in makeDendrogram can affect how the clusters are hierarchically clustered, and we want to encourage the user make these choices.

    ce<-makeDendrogram(ce,dimReduce="mostVar",ndims=500)
    plotDendrogram(ce)
    <<<<<<< HEAD

    We can see that clusters 1 and 4 are most closely related, at least in the top 500 most variable genes. We now run mergeClusters that will go up this hierarchy and compare the level of DE in each pair. In otherwords, DE tests are run between 1 and 4, and between 2 and 3. If there is not enough DE between each of these (based on a cutoff that can be set by the user), then clusters 1 and 4 will be merged and/or 2 and 3. And so on up the tree.

    It is useful to first run mergeClusters without actually creating any merged clusters so as to preview what the final clustering will be (and perhaps to help in setting the cutoff).

    mergeClusters(ce,mergeMethod="adjP",plot="mergeMethod")

    Then we can decide on a cutoff:

    ce<-mergeClusters(ce,mergeMethod="adjP",cutoff=0.05)
    par(mar=plotCMar)
    plotClusters(ce,whichClusters="workflow")

    Notice that two sets of clusters were merged into a larger cluster.

    Finally, we can do a heatmap visualizing our this final step of clustering.

    plotHeatmap(ce,clusterSamplesData="dendrogramValue",breaks=.99, sampleData=c("Biological_Condition", "Cluster1", "Cluster2"))

    By choosing “dendrogramValue” for the clustering of the samples, we will be showing the clusters according to the hierarchical ordering of the clusters found by makeDendrogram. breaks=0.99 means that the last color of the heatmap spectrum will be forced to be the top 1% of the data (rather than evenly spaced through the entire range of values). This can be helpful in making sure that rare extreme values in the upper range do not absorb too much space in the color spectrum. There are many more options for plotHeatmap, some of which are discussed in the section on plotHeatmap.

    =======

    We can see that clusters 4 and 5 are most closely related, at least in the top 500 most variable genes. We now run mergeClusters that will go up this hierarchy and compare the level of DE in each pair. In otherwords, DE tests are run between 4 and 5, and between 1 and 7. If there is not enough DE between each of these (based on a cutoff that can be set by the user), then clusters 4 and 5 will be merged and/or 1 and 7. And so on up the tree.

    It is useful to first run mergeClusters without actually creating any merged clusters so as to preview what the final clustering will be (and perhaps to help in setting the cutoff).

    mergeClusters(ce,mergeMethod="adjP",plot="mergeMethod")

    Then we can decide on a cutoff and visualize the resulting clustering:

    ce<-mergeClusters(ce,mergeMethod="adjP",cutoff=0.01)
    par(mar=plotCMar)
    plotClusters(ce,whichClusters="workflow", sampleData=c("Biological_Condition","Cluster2"))
    plotCoClustering(ce,whichClusters=c("mergeClusters","combineMany"))

    Notice that mergeClusters combines clusters based on the actual values of the features, while the coClustering plot is how often the samples clustered together. It is not uncommon that mergeClusters will merge clusters that don’t look “close” on the coClustering plot. This can be due to just the choices of the hierarchical clustering, but can also be because the two merged clusters are not often confused for each other across the clustering algorithms, yet don’t have strong differences on individual genes. This can be the case especially when the clustering is done on reduced PCA space, where an accumulation of small differences might consistently separate the samples (so the two clusters are not “confused” as to the samples), but because the differences are not strong on individual genes, mergeClusters combines them. These are ultimately different criteria.

    Finally, we can do a heatmap visualizing our this final step of clustering.

    plotHeatmap(ce,clusterSamplesData="dendrogramValue",breaks=.99, sampleData=c("Biological_Condition", "Cluster1", "Cluster2"))

    By choosing “dendrogramValue” for the clustering of the samples, we will be showing the clusters according to the hierarchical ordering of the clusters found by makeDendrogram. The argument breaks=0.99 means that the last color of the heatmap spectrum will be forced to be the top 1% of the data (rather than evenly spaced through the entire range of values). This can be helpful in making sure that rare extreme values in the upper range do not absorb too much space in the color spectrum. There are many more options for plotHeatmap, some of which are discussed in the section on plotHeatmap.

    >>>>>>> release/0.1.0

    2.6 Step 4: Finding Features related to the clusters

    The last step is to then find features that are different between these clusters, as a start to understanding biological differences in the samples. The function getBestFeatures performs tests for differential expression (i.e. different mean levels) between the clusters for each feature. It relies on limma to run the differential expression analysis, with voom correction if the data are indicated by the user to be counts.

    There are several types of tests that can be performed to identify features that are different between the clusters. Here we perform all pairwise tests between the clusters.

    pairsAll<-getBestFeatures(ce,contrastType="Pairs",p.value=0.05,number=nrow(ce),isCount=TRUE)
    ## Note: If `isCount=TRUE` the data will be transformed with voom() rather than
    ## with the transformation function in the slot `transformation`.
    ## This makes sense only for counts.
    ## Error in topTableF(fit2, genelist = rownames(fit$coef), ...): unused argument (type = "Pairs")
    head(pairsAll)
    <<<<<<< HEAD
    ##   IndexInOriginal Contrast Feature     logFC  AveExpr          t
    ## 1           12316    X1-X2 ZFP36L1 -7.931762 2.525658 -12.666798
    ## 2            4514    X1-X2   HMGB2 -8.837433 3.837891 -10.805657
    ## 3           11260    X1-X2   TOP2A -8.519551 3.946404  -9.761751
    ## 4            7210    X1-X2  NUSAP1 -8.574488 3.328983  -9.693685
    ## 5            5817    X1-X2  MAD2L1 -7.056108 1.296928  -9.426996
    ## 6           12346    X1-X2    ZIC2 -7.468349 1.807346  -9.372529
    ##        P.Value    adj.P.Val        B
    ## 1 4.794367e-20 6.179939e-16 33.60110
    ## 2 9.354477e-17 6.028961e-13 26.70191
    ## 3 7.632520e-15 3.279439e-11 22.59965
    ## 4 1.019987e-14 3.286908e-11 22.45235
    ## 5 3.185763e-14 8.212896e-11 21.51072
    ## 6 4.022098e-14 8.640807e-11 21.28419

    We can visualize only these significantly different pair-wise features with plotHeatmap by using the column “IndexInOriginal” in the result of getBestFeatures to quickly identify the genes to be used in the heatmap. Notice that the same genes can be replicated across different contrasts, so we will not always have unique genes:

    length(pairsAll$Feature)==length(unique(pairsAll$Feature))
    ## [1] TRUE

    In this case they are unique. But to be careful (and for future use of this code), we will make sure we take only unique gene values so that they are not plotted multiple times in our heatmap.

    plotHeatmap(ce, clusterSamplesData="dendrogramValue",clusterFeaturesData=unique(pairsAll[,"IndexInOriginal"]),main="Heatmap of features w/ significant pairwise differences",breaks=.99)
    ## Error in .local(data, ...): invalid indices for clusterFeaturesData
    =======
    ##   IndexInOriginal Contrast Feature      logFC    AveExpr         t
    ## 1            6156    X1-X2   TOP2A -11.479314 4.71463425 -8.480137
    ## 2            3339    X1-X2   MKI67 -10.364746 1.86763588 -8.323743
    ## 3            1802    X1-X2   ESCO2  -9.900535 0.04592192 -7.688447
    ## 4            2499    X1-X2   HMGB2  -9.759178 4.66975444 -7.139022
    ## 5            1054    X1-X2   CLSPN  -9.596555 2.62408690 -6.756329
    ## 6             915    X1-X2   CENPF -10.066812 4.75415121 -6.528264
    ##        P.Value    adj.P.Val         B
    ## 1 6.463170e-13 5.229247e-10 17.339294
    ## 2 1.333963e-12 1.009906e-09 16.246562
    ## 3 2.500029e-11 1.387320e-08 13.637250
    ## 4 3.067088e-10 1.140634e-07 11.674828
    ## 5 1.718258e-09 5.130090e-07 10.207339
    ## 6 4.743007e-09 1.220173e-06  9.472463

    We can visualize only these significantly different pair-wise features with plotHeatmap by using the column “IndexInOriginal” in the result of getBestFeatures to quickly identify the genes to be used in the heatmap. Notice that the same genes can be replicated across different contrasts, so we will not always have unique genes:

    length(pairsAll$Feature)==length(unique(pairsAll$Feature))
    ## [1] FALSE

    In this case they are not unique. Hence, we will make sure we take only unique gene values so that they are not plotted multiple times in our heatmap. (This is a good practice even if in a particular case the genes are unique).

    plotHeatmap(ce, clusterSamplesData="dendrogramValue",clusterFeaturesData=unique(pairsAll[,"IndexInOriginal"]),main="Heatmap of features w/ significant pairwise differences",breaks=.99)

    >>>>>>> release/0.1.0

    Notice that the samples clustered into the -1 cluster (i.e. not assigned) are clustered as an outgroup. They can also be mixed into the dendrogram (see on makeDendrogram)

    3 ClusterExperiment Objects

    The ce object that we created from calling clusterMany is a ClusterExperiment object. The ClusterExperiment class is used by this package to keep the data and the clusterings together. It inherits from SummarizedExperiment, which means the data and colData and other information orginally in the fluidigm object is retained and can be accessed with the same functions as before. The ClusterExperiment object additionally stores clusterings and information about the clusterings along side the data. This helps keep everything together, and like the SummarizedExperiment object, allows for simple things like subsetting to a reduced set of subjects and being confident that the corresponding clusterings, colData, and so forth are similarly subset.

    Typing the name at the control prompt results in a quick summary of the object.

    ce
    ## class: ClusterExperiment 
    ## dim: 7066 65 
    ## Primary cluster type: mergeClusters 
    ## Primary cluster label: mergeClusters 
    ## Table of clusters (of primary clustering):
    <<<<<<< HEAD
    ## -1 m1 m2 m3 
    ## 21 19 16  9 
    ## Total number of clusterings: 39 
    =======
    ## -1 m1 m2 m3 m4 m5 m6 
    ##  5 12  4  7 15 19  3 
    ## Total number of clusterings: 40 
    >>>>>>> release/0.1.0
    ## clusterMany run? Yes 
    ## combineMany run? Yes 
    ## makeDendrogram run? No 
    ## mergeClusters run? Yes

    This summary tells us the total number of clusterings (40), and gives some indication as to what parts of the standard workflow have been completed and stored in this object. It also gives information regarding the primaryCluster of the object. The primaryCluster is just one of the clusters that has been choosen to be the “primary” cluster, meaning that by default various functions will turn to this clustering as the desired cluster to use. The “primaryCluster” can be reset by the user (see primaryClusterIndex). clusterMany arbitrarily sets the ‘primaryCluster’ to the first one, and each later step of the workflow sets the primary index to the most recent, but the user set a specific clustering to be the primaryCluster with primaryClusterIndex.

    There are also additional commands to access the clusterings and their related information (type help("ClusterExperiment-methods") for more).

    The cluster assignments are stored in the clusterMatrix slot of ce, with samples on the rows and different clusterings on the columns. We can look at the cluster matrix and the primary cluster with the commands clusterMatrix and primaryCluster

    head(clusterMatrix(ce))[,1:5]
    <<<<<<< HEAD
    ##      mergeClusters combineMany combineMany.1 nVarFeatures.100.k.5
    ## [1,]             1           1             1                    1
    ## [2,]             1           1            -1                    2
    ## [3,]            -1          -1            -1                    3
    ## [4,]             1           1             1                    1
    ## [5,]             2           2            -1                    3
    ## [6,]            -1          -1            -1                    4
    ##      nVarFeatures.500.k.5
    ## [1,]                    1
    ## [2,]                    1
    ## [3,]                    2
    ## [4,]                    1
    ## [5,]                    2
    ## [6,]                   -1
    primaryCluster(ce)
    ##  [1]  1  1 -1  1  2 -1 -1  2  2  2  3  3  1 -1 -1  1 -1  2 -1 -1  3  3  3
    ## [24]  2  1  2  2  2 -1 -1 -1  1  1 -1 -1  1  1  3 -1  1 -1  2  2  1  2  2
    ## [47]  1 -1 -1 -1  3 -1  2  3 -1 -1  1  2  2  1  1  1  3  1  1
    =======
    ##      mergeClusters combineMany combineMany,0.7 combineMany,default
    ## [1,]             1           1               1                  -1
    ## [2,]            -1          -1              -1                  -1
    ## [3,]             2           2              -1                  -1
    ## [4,]             1           3               2                  -1
    ## [5,]            -1          -1              -1                  -1
    ## [6,]            -1          -1              -1                  -1
    ##      nVar=100,k=5
    ## [1,]            1
    ## [2,]            1
    ## [3,]            2
    ## [4,]            3
    ## [5,]            4
    ## [6,]            4
    primaryCluster(ce)
    ##  [1]  1 -1  2  1 -1 -1  3  4  4  4  5  5  1  5  6  1  6  4  3  3  5  5  5
    ## [24]  4 -1  4  4  4  5  5  5  1  5  3  3  5  6  5  2  5  3  4  4  1  4  4
    ## [47]  1  2  3  5  5  5  4  5  2  1  1  4  4  1  5  1  5  1 -1
    >>>>>>> release/0.1.0

    Notice how some extra clusterings have snuck in, like combineMany.1 because we made multiple calls to combineMany; only the last such call has the proper name combineMany and therefore will be shown when we use whichClusters="workflow" in our plotting (see this section for a discussion of how these repeated calls are handled.)

    Negative Valued clusters The different clusters are stored as consecutive integers, with ‘-1’ and ‘-2’ having special meaning. ‘-1’ refers to samples that were not clustered by the clustering algorithm. In our example, we removed samples with negative silhouette distance to their cluster, so they were assigned ‘-1’. ‘-2’ is for samples that were not included in the original input to the clustering. This is useful, if for example, you cluster on a subset of the samples, and then want to store this clustering with the clusterings done on all the data. You can create a vector of clusterings that give ‘-2’ to the samples not originally used and then add these clusterings to the ce object manually with addClusters.

    clusterLabels gives the column names of the clusterMatrix; clusterMany has given column names based on the parameter choices, and later steps in the workflow also give a name. But clusterings might also have no specific label if the user created them.

    head(clusterLabels(ce),10)
    ##  [1] "mergeClusters"       "combineMany"         "combineMany,0.7"    
    ##  [4] "combineMany,default" "nVar=100,k=5"        "nVar=500,k=5"       
    ##  [7] "nVar=1000,k=5"       "nPCA=5,k=5"          "nPCA=15,k=5"        
    ## [10] "nPCA=50,k=5"

    clusterTypes on the otherhand indicates what call made the clustering.

    head(clusterTypes(ce),10)
    ##  [1] "mergeClusters" "combineMany"   "combineMany.2" "combineMany.1"
    ##  [5] "clusterMany"   "clusterMany"   "clusterMany"   "clusterMany"  
    ##  [9] "clusterMany"   "clusterMany"

    The information that was in the original fluidigm object has also been preserved, like colData that contains information on each sample.

    colData(ce)[,1:5]
    ## DataFrame with 65 rows and 5 columns
    ##               NREADS  NALIGNED    RALIGN TOTAL_DUP    PRIMER
    ##            <numeric> <numeric> <numeric> <numeric> <numeric>
    ## SRR1275356  10554900   7555880   71.5862   58.4931 0.0217638
    ## SRR1275251   8524470   5858130   68.7213   65.0428 0.0351827
    ## SRR1275287   7229920   5891540   81.4884   49.7609 0.0208685
    ## SRR1275364   5403640   4482910   82.9609   66.5788 0.0298284
    ## SRR1275269  10729700   7806230   72.7536   50.4285 0.0204349
    ## ...              ...       ...       ...       ...       ...
    ## SRR1275259   5949930   4181040   70.2705   52.5975 0.0205253
    ## SRR1275253  10319900   7458710   72.2747   54.9637 0.0205342
    ## SRR1275285   5300270   4276650   80.6873   41.6394 0.0227383
    ## SRR1275366   7701320   6373600   82.7600   68.9431 0.0266275
    ## SRR1275261  13425000   9554960   71.1727   62.0001 0.0200522

    Another important slot in the ClusterExperiment object is the clusterLegend slot. This consists of a list, one element per column or clustering of clusterMatrix.

    length(clusterLegend(ce))
    ## [1] 40
    clusterLegend(ce)[1:2]
    ## $mergeClusters
    ##    clusterIds color     name
    ## 1  "1"        "#1F78B4" "m1"
    ## -1 "-1"       "white"   "-1"
    ## 2  "2"        "#33A02C" "m2"
    ## 3  "3"        "#E31A1C" "m3"
    <<<<<<< HEAD
    =======
    ## 4  "4"        "#FF7F00" "m4"
    ## 5  "5"        "#6A3D9A" "m5"
    ## 6  "6"        "#B15928" "m6"
    >>>>>>> release/0.1.0
    ## 
    ## $combineMany
    ##    clusterIds color     name
    ## 1  "1"        "#1F78B4" "c1"
    ## -1 "-1"       "white"   "-1"
    ## 2  "2"        "#33A02C" "c2"
    ## 3  "3"        "#E31A1C" "c3"
    <<<<<<< HEAD
    ## 
    ## $combineMany.1
    ##    clusterIds color     name
    ## 1  "1"        "#1F78B4" "c1"
    ## -1 "-1"       "white"   "-1"
    ## 2  "2"        "#33A02C" "c3"
    ## 3  "3"        "#E31A1C" "c2"
    ======= ## 4 "4" "#FF7F00" "c4" ## 5 "5" "#6A3D9A" "c5" ## 6 "6" "#B15928" "c6" ## 7 "7" "#2ef4ca" "c7" ## 8 "8" "#bd18ea" "c8" >>>>>>> release/0.1.0

    We can see that each element of clusterLegend consists of a matrix, with number of rows equal to the number of clusters in the clustering. The columns store information about that cluster. clusterIds is the internal id used in clusterMatrix to identify the cluster, name is a name for the cluster, and color is a color for that cluster. color is used in plotting and visualizing the clusters, and name is an arbitrary character string for a cluster. They are automatically given default values when the ClusterExperiment object is created, but we will see under the description of visualization methods how the user might want to manipulate these for better plotting results.

    4 Visualizing the data

    4.1 Plotting the clusters

    We demonstrated during the quick start that we can visualize multiple clusterings using the plotClusters command.

    par(mar=plotCMar)
    plotClusters(ce,main="Clusters from clusterMany",whichClusters="workflow",axisLine=-1)
    <<<<<<< HEAD

    We can assign new labels if we prefer, for example to be more succinct by changing the clusterLabels of the object (note we are permanently changing the labels here within the ClusterExperiment object)

    cl<-clusterLabels(ce)
    cl<-gsub("Features","",cl)
    clusterLabels(ce)<-cl
    par(mar=plotCMar)
    plotClusters(ce,main="Clusters from clusterMany",axisLine=-1)

    =======

    >>>>>>> release/0.1.0

    We can get very different plots depending on how we order the clusterings. The argument whichClusters allows you to give choose different clusters or provide an explicit ordering of the clusters. whichClusters can take either a single character value, or a vector of either characters or indices. If whichClusters matches either “all” or “workflow”, then the clusters chosen are either all clusters, or only clusters from the most recent calls to the workflow function. Choosing “workflow” removes from the visualization both user-defined clusters and also previous calls to the workflow that have since been rerun. Setting whichClusters="workflow" can be a useful if you have called a method like combineMany several times, as we did, only with different parameters. All of those runs are saved (unless eraseOld=TRUE), but you may not want to plot them.

    If whichClusters is a character that is not one of these designated values, the entries should match a cluster type (like clusterMany). Alternatively, the user can specify specific numeric indices corresponding to the columns of clusterMatrix that provide the order of the clusters.

    par(mar=plotCMar)
    plotClusters(ce,whichClusters="clusterMany",
                   main="Only Clusters from clusterMany",axisLine=-1)
    <<<<<<< HEAD

    =======

    >>>>>>> release/0.1.0

    We can also add to our plot (categorical) information on each of our subjects from the colData of our fluidigm object (which is also retained in our ClusterExperiment object). This can be helpful to see if the clusterings correspond to other features of the samples, such as sample batches. Here we add the values from the columns “Biological_Condition” and “Cluster2” that were present in the fluidigm object and given with the published data.

    par(mar=plotCMar)
    plotClusters(ce,whichClusters="workflow", sampleData=c("Biological_Condition","Cluster2"), 
                   main="Workflow clusters plus other data",axisLine=-1)
    <<<<<<< HEAD

    =======

    >>>>>>> release/0.1.0

    4.1.1 Saving the alignment of plotClusters

    plotClusters invisibly returns a ClusterExperiment object. In our earlier calls to plotCluster, this would be the same as the input object and so there is no reason to save it. However, the alignment and color assignments created by plotClusters can be requested to be saved via the resetNames, resetColors and resetOrderSamples arguments. If any of these are set to TRUE, then the object returned will be different than those of the input. Specifically, if resetColors=TRUE the colorLegend of the returned object will be changed so that the colors assigned to each cluster will be as were shown in the plot. Similarly, if resetNames=TRUE the names of the clusters will be changed to be integer values, but now the integers will be aligned to try to be the same across clusters (and therefore not consecutive integers, which is why these are saved as names for the clusters and not clusterIds). If resetOrderSamples=TRUE, then the order of the samples shown in the plot will be similarly saved in the slot orderSamples.

    Here we make a call to plotClusters, but now ask to reset everything to match this ordering. We’ll save this as a different object so that this is not a permanent change for the rest of the vignette.

    par(mar=plotCMar)
    ce_temp<-plotClusters(ce,whichClusters="workflow", sampleData=c("Biological_Condition","Cluster2"), 
                   main="Clusters from clusterMany, different order",axisLine=-1,resetNames=TRUE,resetColors=TRUE,resetOrderSamples=TRUE)
    <<<<<<< HEAD

    clusterLegend(ce)[c("mergeClusters","combineMany")]
    =======

    clusterLegend(ce_temp)[c("mergeClusters","combineMany")]
    >>>>>>> release/0.1.0
    ## $mergeClusters
    ##    clusterIds color     name
    ## 1  "1"        "#1F78B4" "1" 
    ## -1 "-1"       "white"   "-1"
    ## 2  "2"        "#33A02C" "2" 
    ## 3  "3"        "#E31A1C" "3" 
    <<<<<<< HEAD
    =======
    ## 4  "4"        "#FF7F00" "4" 
    ## 5  "5"        "#6A3D9A" "5" 
    ## 6  "6"        "#B15928" "6" 
    >>>>>>> release/0.1.0
    ## 
    ## $combineMany
    ##    clusterIds color     name
    ## 1  "1"        "#1F78B4" "1" 
    ## -1 "-1"       "white"   "-1"
    ## 2  "2"        "#33A02C" "2" 
    <<<<<<< HEAD
    ## 3  "3"        "#E31A1C" "3"

    Now, the clusterLegend slot of the object no longer has the default color/name assignments, but now has names and colors that match across the clusters.

    ======= ## 3 "3" "#bd18ea" "8" ## 4 "4" "#E31A1C" "3" ## 5 "5" "#FF7F00" "4" ## 6 "6" "#6A3D9A" "5" ## 7 "7" "#B15928" "6" ## 8 "8" "#A6CEE3" "9"

    Now, the clusterLegend slot of the object no longer has the default color/name assignments, but now has names and colors that match across the clusters. Notice, that this means the prefix “m” or “c” that was previously given to the distinguish the combineMany result from the mergeClusters result is now gone (the user could manually add them by changing the clusterLegend values).

    >>>>>>> release/0.1.0

    We can also force plotClusters to use the existing color definitions, rather than create its own. This makes particular sense if you want to have continuity between plots – i.e. be sure that a particular cluster always has a certain color – but would like to do different variations of plotClusters to get a sense of how similar the clusters are.

    For example, I set the colors above based on the cluster order from plotClusters where the clusters were ordered according to the workflow. But now I want to plot only the clusters from clusterMany, yet keep the same colors as before so I can compare them. I do this by setting the argument existingColors="all", meaning use all of the existing colors (currently this is the only option available for how to use the existing colors).

    par(mar=plotCMar)
    par(mfrow=c(1,2))
    plotClusters(ce_temp, sampleData=c("Biological_Condition","Cluster2"), whichClusters="workflow", existingColors="all",
                   main="Workflow Clusters, fix the color of clusters",axisLine=-1)
    
    plotClusters(ce_temp, sampleData=c("Biological_Condition","Cluster2"), existingColors="all",
                 whichClusters="clusterMany",
                   main="clusterMany Clusters, fix the color of clusters",axisLine=-1)
    <<<<<<< HEAD

    =======

    >>>>>>> release/0.1.0

    4.2 Heatmap with the clusters

    There is also a default heatmap command for a ClusterExperiment object that we used in the Quick Start. By default it clusters on the most variable features (after transforming the data) and shows the primaryCluster alongside the data. The primaryCluster now that we’ve run the workflow will be set as that from the mergeClusters step.

    <<<<<<< HEAD
    plotHeatmap(ce,main="Heatmap with clusterMany")

    =======
    par(mfrow=c(1,1))
    par(mar=defaultMar)
    plotHeatmap(ce,main="Heatmap with clusterMany")

    >>>>>>> release/0.1.0

    The plotHeatmap command has numerous options, in addition to those of aheatmap. plotHeatmap mainly provides additional functionality in the following areas:

    4.2.1 Displaying clustering or sample information

    Like plotClusters, plotHeatmap has a whichClusters option that behaves similarly to that of plotClusters. In addition to the options all and workflow that we saw with plotClusters, plotHeatmap also takes the option none (no clusters shown) and primary (only the primaryCluster). The user can also request a subset of the clusters by giving specific indices to whichClusters like in plotClusters.

    <<<<<<< HEAD

    Here we create a heatmap that shows the clusters from the workflow.

    plotHeatmap(ce,whichClusters="workflow",main="Heatmap with clusterMany, all clusters",annLegend=FALSE)
    ## Error in par(plt = gridPLT(), new = TRUE): invalid value specified for graphical parameter "plt"

    =======

    Here we create a heatmap that shows the clusters from the workflow. Notice that we choose only the last 2 – from combineMany and mergeClusters. If we chose all “workflow” clusters it would be too many.

    whClusterPlot<-1:2
    plotHeatmap(ce,whichClusters=whClusterPlot, annLegend=FALSE)

    >>>>>>> release/0.1.0

    Notice I also passed the option ‘annLegend=FALSE’ to the underlying aheatmap command (with many clusterings shown, it is often not useful to have a legend for all the clusters because the legend doesn’t fit on the page!). The many detailed commands of aheatmap that are not set internally by plotHeatmap can be passed along as well.

    Like plotClusters, plotHeatmap takes an argument sampleData, which refers to columns of the colData of that object and can be included.

    4.2.2 Additional options for clustering/ordering samples

    <<<<<<< HEAD

    I can choose to instead to not cluster the samples, but order them by cluster:

    plotHeatmap(ce,clusterSamplesData="primaryCluster",whichClusters="all",main="Heatmap with clusterMany",annLegend=FALSE)
    ## Error in plot.new(): figure margins too large
    ## Error in .Call.graphics(L_gridDirty): invalid graphics state

    As an improvement upone this, I can cluster my clusters into a dendrogram so that the most similar clusters will be near each other. I could do this by explicitly calling makeDendrogram (and therefore be able to control the choices in how it is done, which we will discuss below) which will store a dendrogram in our clusterExperiment object; alternatively plotHeatmap will call makeDendrogram with its default settings, which is what I will do here. I will also add some data about my samples from my colData slot using the sampleData argument.

    plotHeatmap(ce,clusterSamplesData="dendrogramValue",whichClusters="primaryCluster",main="Heatmap with clusterMany",sampleData=c("Biological_Condition","Cluster2"),annLegend=FALSE)

    =======

    I can choose to instead to not cluster the samples, but order them by cluster. This time I’ll just show the primary cluster (the mergeCluster result) by setting whichClusters="primaryCluster":

    plotHeatmap(ce,clusterSamplesData="primaryCluster",whichClusters="primaryCluster",main="Heatmap with clusterMany",annLegend=FALSE)

    As an improvement upone this, I can cluster my clusters into a dendrogram so that the most similar clusters will be near each other. I could do this by explicitly calling makeDendrogram (and therefore be able to control the choices in how it is done, which we will discuss below) which will store a dendrogram in our clusterExperiment object and plotHeatmap will pull from it; alternatively plotHeatmap will call makeDendrogram (based on the primary cluster) with its default settings, which is what I will do here. For visualization purposes, it’s better to have the dendrogram for the combineMany cluster rather than the mergeCluster cluster (since the mergeCluster is based on the dendrogram anyway). So I’m going to set the primary cluster index to 2.

    I will also add some data about my samples from my colData slot using the sampleData argument.

    primaryClusterIndex(ce)<-2
    plotHeatmap(ce,clusterSamplesData="dendrogramValue",whichClusters=whClusterPlot,main="Heatmap with clusterMany",sampleData=c("Biological_Condition","Cluster2"),annLegend=FALSE)

    >>>>>>> release/0.1.0

    4.2.3 Using separate input data for clustering and for visualization

    While count data is a common type of data, it is also common that the input data in the SummarizedExperiment object might be normalized data from a normalization package such as RUVSeq. In this case, the clustering and all numerical calculations should be done on the normalized data (which may or may not need a log transform). However, these normalized data might not be on a logical count scale (for example, in RUVSeq, the normalize data are residuals after subtracting out gene-specific batch effects).

    In this case, it can be convenient to have the visualization of the data (i.e. the color scale), be based on a count scale that is interpretable, even while the clustering is done based on the normalized data. This is possible by giving a new matrix of values to the argument visualizeData. In this case, the color scale (and clustering of the features) is based on the input visualizeData matrix, but all clustering of the samples is done on the internal data in the ClusterExperiment object.

    5 The clustering workflow

    We will now go into more detail about important options for the main parts of the clustering workflow.

    5.1 clusterMany

    5.1.1 Overview of the implemented clustering procedures

    In the quick start section we picked some simple and familiar clustering options that would run quickly and needed little explanation. However, our workflow generally assumes more complex options and more parameter variations are tried. Before getting into the specific options of clusterMany, let us first describe some of these more complicated setups, since many of the arguments of clusterMany depend on understanding them.

    Clustering Algorithms (clusterD): Clustering algorithms generally start with a particular predefined distance or dissimilarity between the samples, usually encoded in a \(n x n\) matrix, \(D\). In our package, we consider only such clustering algorithms. The input could also be similarities, though we will continue to call such a matrix \(D\).

    The simplest scenario is a simple calculation of \(D\) and then a clustering of \(D\), usually dependent on particular parameters. In this package we try to group together algorithms that cluster \(D\) based on common parameters and operations that can be done. Currently there are two “types” of algorithms we consider, which we call type “K” and “01”. The help files of clusterD document these choices more fully, but we give an overview here. Many of the parameters that are allowed to varied in clusterMany refer to parameters for the clustering of the \(D\) matrix, either for “K” or “01” type algorithms.

    The “K” algorithms are so called because their main parameter requirement is that the user specifies the number of clusters (\(K\)) to be created. They assume the input \(D\) are dissimilarities, and depending on the algorithm, may have additional expectations. “pam” and “kmeans” are examples of such types of algorithms.

    The “01” algorithms are so named because the algorithm assumes that the input \(D\) consists of similarities between samples and that the similarities encoded in \(D\) are on a scale of 0-1. They use this fact to make the primary user-specified parameter be not the number of final clusters, but a measure \(\alpha\) of how dissimilar samples in the same cluster can be (on a scale of 0-1). Given \(\alpha\), the algorithm then implements a method to then determine the clusters (so \(\alpha\) implicitly determines \(K\)). These methods rely on the assumption that because the 0-1 scale has special significance, the user will be able to make an determination more easily as to the level of dissimilarity allowed in a true cluster, rather than predetermine the number of clusters \(K\). The current 01 methods are “tight”, “hierarchical01” and “pam”.

    Subsampling In addition to the basic clustering algorithms on a matrix \(D\), we also implement many other common cluster processing steps that are relevant to the result of the clustering. We have already seen such an example with dimensionality reduction, where the imput \(D\) is determined based on different input data. A more significant processing that we perform is calculation of a dissimilarity matrix \(D\) not by a distance on the vector of data, but by subsampling and clustering of the subsampled data. The resulting \(D\) matrix is a matrix of co-clustering percentages. Each entry is a percentage of subsamples where the two samples shared a clustering over the many subsamplings of the data (there are slight variations as how this can be calculated, see help pages of subsampleClustering ). Note that this implies there actually two different clustering algorithms (and sets of corresponding parameters) – one for the clustering on the subsampled data, and one for the clustering of the resulting \(D\) of the percentage of coClustering of samples. clusterMany focuses on varying parameters related to the clustering of \(D\), and generally assumes that the underlying clustering on the subsampled data is simple (e.g. “pam”). (The underlying clustering machinery in our package is performed by a function clusterSingle which is not discussed in this tutorial, but can be called explicitly for fine-grained control over all of the features ). The subsampling option is computationally expensive, and when coupled with comparing many parameters, does result in a length evaluation of clusterMany. However, we recommend it as one of the most useful methods for getting stable clustering results.

    Sequential Detection of Clusters Another complicated addition to the clustering that requires additional explanation is the implementation of sequential clustering. This refers to clustering of the data, then removing the “best” cluster, and then re-clustering the remaining samples, and then continuing this iteration until all samples are clustered (or the algorithm in some other way calls a stop). Such sequential clustering can often be convenient when there is very dominant cluster, for example, that is far away from the other mass of data. Removing samples in these clusters and resampling can sometimes be more productive and result in results more robust to the choice of samples. A particular implementation of such a sequential method, based upon (Tseng and Wong 2005), is implemented in the clusterExperiment package. Because of the iterative nature of this process, there are many possible parameters (see help(seqCluster)). clusterMany does not allow variation of very many of these parameters, but instead just has the choice of running the sequential clustering or not. Sequential clustering can also be quite computationally expensive, particularly when paired with subsampling to determine \(D\) at each step of the iteration.

    5.1.2 Arguments of clusterMany

    Now that we’ve explained the underlying architecture of the clustering provided in the package, we discuss the parameters that can be varied in clusterMany. There are additional arguments available for clusterMany but right now we focus on only the ones that can be given multiple options. Recall that arguments in clusterMany that take on multiple values mean that the combinations of all the multiple valued arguments will be given as imput for a clustering routine.

    • sequential This parameter consists of logical values, TRUE and/or FALSE, indicating whether the sequential strategy should be implemented or not.
    • subsample This parameter consists of logical values, TRUE and/or FALSE, indicating whether the subsampling strategy for determining \(D\) should be implemented or not.
    • clusterFunction The clustering functions to be tried. If subsample=TRUE is part of the combination, then clusterFunction the method that will be used on the co-clustering matrix \(D\) created from subsampling the data (where pam clustering is used on the subsampled data). Otherwise, clusterFunction is the clustering method that will be used on dist(t(x))
    • ks The argument ‘ks’ is interpreted differently for different choices of the other parameters. If sequential=TRUE is part of the combination, ks defines the argument k0 of sequential clustering (see help(seqCluster)), which is approximately like the initial starting point for the number of clusters in the sequential process. Otherwise, ks is passed to set \(K\) of both the clustering of subsampled data and the actual clustering of the data (if a \(K\) needs to be set, i.e. a type “K” algorithm). When/if findBestK=TRUE is part of the combination, ks also defines the range of values to search for the best k (see the details in help(clusterMany) for more).
    • dimReduce These are character strings indicating what choices of dimensionality reduction should be tried. The choices are “PCA”, indicating clustering on the top principal components, “mostVar”, indicating clustering on the top most variable features, and “none”, indicating the whole data set should be used. If either “PCA” or “mostVar” are chosen, the following parameters indicate the number of such features to be used (and can be a vector of values as we have seen):
      • nVarDims
      • nPCADims
    • alphas These are the \(\alpha\) parameters for “01” clustering techniques; these values are only relevant if one of the clusterFunction values is a “01” clustering algorithm. The values given to alphas should be between 0 and 1, with smaller values indicating greater similarity required between the clusters.
    • findBestK This option is for “K” clustering techniques, and indicates that \(K\) should be choosen automatically as the \(K\) that gives the largest silhouette distance between clusters.
    • removeSil A logical value as to whether samples with small silhouette distance to their assigned cluster are “removed”, in the sense that they are not given their original cluster assignment but instead assigned -1. This option is for “K” clustering techniques as a method of removing poorly clustered samples (the “01” techniques used by clusterMany generally do this intrinsically as part of the algorithm).
    • silCutoff If removeSil is TRUE, then silCutoff determines the cutoff on silhouette distance for unassigning the sample.

    clusterMany tries to have generally simple interface, and for this reason makes choices about what is meant by certain combinations. For example, in combinations where findBestK=TRUE, ks=2:10 is taken to mean that the clustering should find the best \(k\) out of the range of 2-10. However, in other combinations ks might indicate the specific number of clusters, \(k\), that should be found. For parameter combinations that are not what is desired, the user should consider making direct calls to clusterSingle where all of these options (and many more) can be explicitly called.

    Other parameters for the clustering are kept fixed. As described above, there are many more possible parameters in play than are considered in clusterMany. These parameters can be set via the arguments clusterDArgs, subsampleArgs and seqArgs. These arguments correspond to the different processes described above (the clustering of \(D\), the creation of \(D\) via subsampling, and the sequential clustering process, respectively). These arguments take a list of arguments that are sent directly to clusterSingle. However, these arguments may be overridden by the interpretation of clusterMany of how different combinations interact; again for complete control direct calls to clusterSingle are necessary.

    5.1.3 Dealing with large numbers of clusterings

    A good first check before running clusterMany is to determine how many clusterings you are asking for. clusterMany has some limited internal checks to not do unnecessary duplicates (e.g. removeSil only works with some clusterFunctions so clusterMany would detect that), but generally takes all combinations. This can take a while for more complicated clustering techniques, so it is a good idea to check what you are getting into. You can do this by running clusterMany with run=FALSE

    In the following we consider expanding our original clustering choices to consider individual choices of \(K\) (rather than just findBestK=TRUE).

    checkParam<-clusterMany(se, clusterFunction="pam", ks=2:10,removeSil=c(TRUE,FALSE), isCount=TRUE,dimReduce=c("PCA","mostVar"),nVarDims=c(100,500,1000),nPCADims=c(5,15,50),run=FALSE)
    dim(checkParam$paramMatrix) #number of rows is the number of clusterings
    ## [1] 108  10

    Each row of the matrix checkParam$paramMatrix is a requested clustering (the columns indicate the value of a possible parameter). Our selections indicate 108 different clusterings (!).

    We can set ncores argument to have these clusterings done in parallel. If ncores>1, the parallelization is done via mclapply and should not be done in the Rgui interface (see help pages for mclapply).

    5.2 Create a unified cluster from many clusters with combineMany

    After creating many clusterings, combineMany finds a single cluster based on what samples were in the same clusters throughout the many clusters found by clusterMany. While subsampling the data helps be robust to outlying samples, combining across many clustering parameters can help be robust to choice in parameters, particularly then the parameters give roughly similar numbers of clusters.

    As mentioned in the Quick Start section, the default option for combineMany is to only define a cluster when all of the samples are in the same clusters across all clusterings. However, this is generally too conservative and just results in most samples not being assigned to a cluster.

    <<<<<<< HEAD

    Instead combineMany has a parameter proportion that governs in what proportion of clusterings the samples should be together. Internally, comineMany makes a coClustering matrix \(D\). Like the \(D\) created by subsampling in clusterMany, the coClustering matrix takes on values 0-1 for the proportion of times the samples are together in the clustering. This \(D\) matrix is saved in the ce object and can be visualized with plotCoClustering (which is just a call to plotHeatmap).

    ce<-combineMany(ce,proportion=0.7)
    ## Warning in combineMany(ce, proportion = 0.7): no clusters specified to
    ## combine, using results from clusterMany
    plotCoClustering(ce)

    We can also manually choose the set of clusters to use with the argument whichClusters. Here we choose only the clusters that correspond to using dimensionality reduction using the most variable features.

    wh<-grep("nVarFeatures",clusterLabels(ce))
    ce<-combineMany(ce,whichCluster=wh,proportion=0.7)
    plotCoClustering(ce)

    combineMany performs the clustering by running a “01” clustering algorithm on the \(D\) matrix of percentage co-clustering (the default being “hierarchical01”). The alpha argument to the 01 clustering is 1-proportion. Also passed to the clustering algorithm is the parameter minSize which sets the minimimum size of a cluster.

    =======

    Instead combineMany has a parameter proportion that governs in what proportion of clusterings the samples should be together. Internally, comineMany makes a coClustering matrix \(D\). Like the \(D\) created by subsampling in clusterMany, the coClustering matrix takes on values 0-1 for the proportion of times the samples are together in the clustering. This \(D\) matrix is saved in the ce object and can be visualized with plotCoClustering (which is just a call to plotHeatmap). Recall the one we last made in the QuickStart, with our call to combineMany (proportion=0.7 and minSize=3). Let’s label that one our ‘combineMany,final’ before we start playing around.

    wh<-which(clusterLabels(ce)=="combineMany")
    if(length(wh)!=1) stop() else clusterLabels(ce)[wh]<-"combineMany,final"
    plotCoClustering(ce)

    combineMany performs the clustering by running a “01” clustering algorithm on the \(D\) matrix of percentage co-clustering (the default being “hierarchical01”). The alpha argument to the 01 clustering is 1-proportion. Also passed to the clustering algorithm is the parameter minSize which sets the minimimum size of a cluster.

    We can also manually choose the set of clusters to use in combineMany with the argument whichClusters. Here we choose only the clusters that correspond to using dimensionality reduction using the most variable features. We also set minSize to be lower than the default of 5 to allow for smaller clusters

    wh<-grep("nVar",clusterLabels(ce))
    ce<-combineMany(ce,whichCluster=wh,proportion=0.7,minSize=3)
    wh<-which(clusterLabels(ce)=="combineMany")
    if(length(wh)!=1) stop() else clusterLabels(ce)[wh]<-"combineMany,nVar"
    plotCoClustering(ce)

    We can compare to all of our other versions of combineMany. While they do not all have clusterTypes equal to “combineMany” (only the most recent call has clusterType exactly equal to ), they all have “combineMany” as part of their clusterType, even though they have different clusterLabels (and now we’ll see that it was useful to give them different labels!)

    wh<-grep("combineMany",clusterTypes(ce))
    par(mar=plotCMar)
    plotClusters(ce,whichClusters=rev(wh),axisLine=-1)

    >>>>>>> release/0.1.0

    Treatment of Unclustered assignments -1 values are treated separately in the calculation. In particular, they are not considered in the calculation of percentage co-clustering – the percent co-clustering is taken only with respect to those clusterings where both samples were assigned. However, a post-processing is done to the clusters found from running the clustering on the \(D\) matrix. For each sample, the percentage of times that they were marked -1 in the clusterings is calculated. If this percentage is greater than the argument propUnassigned then the sample is marked as -1 in the clustering returned by combineMany.

    Good scenarios for running combineMany Varying certain parameters result in clusterings better for combineMany than other sets of parameters. In particular, if there are huge discrepancies in the set of clusterings given to combineMany, the results will be a shattering of the samples into many small clusters. Similarly, if the number of clusters \(K\) is very different, the end result will likely be like that of the large \(K\), and how much value that is (rather than just picking the clustering with the largest \(K\)), is debatable. However, for “01” clustering algorithms or clusterings using the sequential algorithm, varying the underlying parameters \(\alpha\) or \(k_0\) often results in roughly similar clusterings across the parameters so that creating a consensus across them is highly informative.

    5.3 Creating a Hierarchy of Clusters and Merging clusters

    As mentioned above, we find that merging clusters together based on the extent of differential expression between the features to be a useful method for combining many small clusters.

    We provide a method for doing this that consists of two steps. Making a hierarchy between the clusterings and then estimating the amount of differential expression at each branch of the hierarchy.

    5.3.1 makeDendrogram

    makeDendrogram creates a hierarchical clustering of the clusters as determined by the primaryCluster of the ClusterExperiment object. In addition to being used for merging clusters, the dendrograms created by makeDendrogram are also useful for ordering the clusters in plotHeatmap as has been shown above.

    makeDendrogam performs hierarchical clustering of the cluster mediods (after transformation of the data) and provides a dendrogram that will order the samples according to this clustering of the clusters. The hierarchical ordering of the dendrograms is saved internally in the ClusterExperiment object

    Like clustering, the dendrogram can depend on what features are included from the data. The same options for clustering are available for the hierarchical clustering of the clusters, namely choices of dimensionality reduction via dimReduce and the number of dimensions via ndims.

    ce<-makeDendrogram(ce,dimReduce="mostVar",ndims=500)
    plotDendrogram(ce)
    <<<<<<< HEAD

    Notice that the plot of the dendrogram shows the hierarchy of the clusters (and color codes them according to the colors stored in colorLegend slot).

    =======

    Notice that the plot of the dendrogram shows the hierarchy of the clusters (and color codes them according to the colors stored in colorLegend slot).

    >>>>>>> release/0.1.0

    If we look at the summary of ce, it now has ‘makeDendrogram’ marked as ‘Yes’.

    ce
    ## class: ClusterExperiment 
    ## dim: 7066 65 
    ## Primary cluster type: combineMany 
    ## Primary cluster label: combineMany,nVar 
    ## Table of clusters (of primary clustering):
    <<<<<<< HEAD
    ## -1 c1 c2 c3 
    ## 14 24 14 13 
    ## Total number of clusterings: 8 
    =======
    ## -1 c1 c2 c3 c4 c5 c6 c7 c8 
    ## 11  4  4  7  3  7 15 11  3 
    ## Total number of clusterings: 41 
    >>>>>>> release/0.1.0
    ## clusterMany run? Yes 
    ## combineMany run? Yes 
    ## makeDendrogram run? Yes 
    ## mergeClusters run? No

    Recall that the most recent run is of combineMany is based on experimenting with different parameters of combineMany, so is combining only the clusterings from clusterMany that use the top most variable genes. We might prefer to run mergeClusters based on running combineMany on all the clusterings of combineMany like in quick start (the combineMany, final).

    However, since we labeled them, we can also just set the primaryCluster to the one with that label (makeDendrogram always pulls from the primaryCluster)

    wh<-which(clusterLabels(ce)=="combineMany,final")
    if(length(wh)!=1) stop() else primaryClusterIndex(ce)<-wh
    ce<-makeDendrogram(ce,dimReduce="mostVar",ndims=500)
    plotDendrogram(ce)

    Note that the clusterType of this clustering is not “combineMany”, but “combineMany_x”, where “x” indicates what iteration it is. If we want it to show up in our “workflow” plots, we need to set its clusterType to be “combineMany” (and change the existing “combineMany” or delete it with removeCluster). This is rather complicated, and future versions of the package will probably provides some functionality to do this.

    Alternatively, we can just redo combineMany with the same parameters and then call makeDendrogram on that output (we’ll give it a new label to keep it separate):

    ce<-combineMany(ce,proportion=0.7,minSize=3)
    ## Warning in combineMany(ce, proportion = 0.7, minSize = 3): no clusters
    ## specified to combine, using results from clusterMany
    ce<-makeDendrogram(ce,dimReduce="mostVar",ndims=500)
    plotDendrogram(ce)

    if(length(wh)!=1) stop() else wh<-which(clusterLabels(ce)=="combineMany")
    clusterLabels(ce)[wh]<-"combineMany,finalV2"

    Dendrograms can get deleted! makeDendrogram operates on the primaryCluster and the dendrograms that are stored internally are required to match the primaryCluster. For this reason if the primaryCluster is changed (either because the user requests it or because some part of the workflow has changed the primaryCluster), then the dendrograms will be deleted in the object returned. Usually makeDendrogram does not take too long to recalculate, but if it is too slow the user’s needs, then it the user may wish to not copy over existing ce at certain steps and instead give the outputs new name (and be sure not to change the primaryCluster with primaryClusterIndex).

    5.3.2 Merging clusters with little differential expression

    We then can use this hierarchy of clusters to merge clusters that show little difference in expression. We do this by testing, for each node of the dendrogram, for which features is the mean of the set of clusters to the right split of the node is equal to the mean on the left split. This is done via the getBestFeatures (see section on getBestFeatures), where the type argument is set to “Dendro”.

    Starting at the bottom of the tree, those clusters that have the percentage of features with differential expression below a certain value (determined by the argument cutoff) are merged into a larger cluster. This testing of differences and merging continues until the estimated percentage of non-null DE features is above cutoff. This means lower values of cutoff result in less merging of clusters. There are multiple methods of estimation of the percentage of non-null features implemented. The option mergeMethod="adjP" which we showed earlier is the simplest: the proportion found significant by calculating the proportion of DE genes at a given False Discovery Rate thresshold (using the Benjamini-Hochberg procedure). However, other methods are also implemented (see the help of mergeClusters).

    mergeClusters can also be run without merging the cluster, and simply drawing a plot showing the dendrogram along with the estimates of the percentage of non-null features to aid in deciding a cutoff and method. By setting plotType="all", all of the estimates of the different methods are displayed simultaneously, while before in the QuickStart, we only showed the default values.

    mergeClusters(ce,mergeMethod="none",plotType="all")
    <<<<<<< HEAD

    =======
    ## Warning in locfdr::locfdr(tstats, plot = 0): f(z) misfit = 6.3. Rerun with
    ## increased df

    >>>>>>> release/0.1.0

    Now we can pick a cutoff.

    ce<-mergeClusters(ce,cutoff=0.05,mergeMethod="adjP")
    ce
    ## class: ClusterExperiment 
    ## dim: 7066 65 
    ## Primary cluster type: mergeClusters 
    ## Primary cluster label: mergeClusters 
    ## Table of clusters (of primary clustering):
    <<<<<<< HEAD
    ## -1 m1 m2 m3 
    ## 14 14 24 13 
    ## Total number of clusterings: 9 
    =======
    ## -1 m1 m2 
    ##  5 31 29 
    ## Total number of clusterings: 43 
    >>>>>>> release/0.1.0
    ## clusterMany run? Yes 
    ## combineMany run? Yes 
    ## makeDendrogram run? No 
    ## mergeClusters run? Yes

    Notice that the summary of ce now has ‘makeDendrogram run?’ as ‘No’. These seems odd, since we just ran it. However, running mergeClusters changed the primaryCluster identification, so the dendrogram we made before no longer matches the existing primaryCluster, so they are discarded. This means if the user wants a dendrogram downstream (e.g. for plotHeatmap) or to rerun mergeClusters then makeDendrogram must be rerun with the appropriate primaryCluster.

    For example, to rerun mergeClusters with a different cutoff, we need to reset the primaryCluster to be the last combineMany clustering and remake the dendrogram:

    wh<-which(clusterTypes(ce)=="combineMany")
    if(length(wh)!=1) stop() else primaryClusterIndex(ce)<-wh
    ce<-makeDendrogram(ce,dimReduce="mostVar",ndims=500)
    ce<-mergeClusters(ce,cutoff=0.15,mergeMethod="MB")
    ce
    ## class: ClusterExperiment 
    ## dim: 7066 65 
    ## Primary cluster type: mergeClusters 
    ## Primary cluster label: mergeClusters 
    ## Table of clusters (of primary clustering):
    <<<<<<< HEAD
    ## -1 m1 m2 m3 
    ## 14 14 24 13 
    ## Total number of clusterings: 10 
    =======
    ## -1 m1 m2 m3 m4 m5 m6 
    ##  5 12 19  7 15  3  4 
    ## Total number of clusterings: 44 
    >>>>>>> release/0.1.0
    ## clusterMany run? Yes 
    ## combineMany run? Yes 
    ## makeDendrogram run? No 
    ## mergeClusters run? Yes

    5.4 Keeping track of and rerunning elements of the workflow

    The commands we have shown above show a workflow which continually saves the results over the previous object, so that additional information just gets added to the existing object.

    What happens if some parts of the clustering workflow are re-run? For example, in the above we reran parts of the workflow when we talked about them in more detail, or to experiment with parameter settings.

    <<<<<<< HEAD

    The workflow commands check for existing clusters of the workflow (based on the clusterType of the clusterings). If there exist clusterings from previous runs and such clusterings came from calls that are “downstream” of the requested clustering, then the method will change their clusterType value by adding a “.i”, where \(i\) is a numerical index keeping track of replicate calls.

    For example, suppose I rerun ‘combineMany’, say with a different parameter choice of the proportion similarity to require. Then combineMany searches the existing clusterings in my imput object. Any existing combineMany results will have their clusterType changed from combineMany to combineMany.x, where \(x\) is the largest such number needed to be greater than any existing combineMany.x (after all, you might do this many times!). Moreover, since mergeClusters is downstream of combineMany in the workflow, currently existing mergeClusters will also get bumped to mergeClusters.x. However, clusterMany is upstream of combineMany (i.e. you expect there to be existing clusterMany before you run combineMany) so nothing will happen to clusterMany.

    This is handled internally, and may never be apparent to the user unless they choose whichClusters="all" in a plotting command. Indeed this is one reason to always pick whichClusters="workflow", so that these saved previous versions are not dispayed. If you really want to, you can see more about the existing iterations and where they are in the clusterMatrix with functions like workflowClusterDetails and workflowClusterTable.

    You can also choose to have all old versions erased by choosing the options eraseOld=TRUE in the call to clusterMany, combineMany and/or mergeClusters. eraseOld=TRUE will delete ALL past workflow results except for those that are both in the current workflow and “upstream” of the requested command.

    Note that there is nothing that governs or protects the clusterType values to be of a certain kind. This means that if the user decides to name a clusterType of a clustering one of these protected names, that is allowed.

    Designate a Final Clustering A final protected clusterType is “final”. This is not created by any method, but can be set as the clusterType of a clustering by the user (via the clusterType command). Any clustering marked final will be be considered one of the workflow values for commands like whichClusters="workflow". However, they will NOT be renamed with “.x” or removed if eraseOld=TRUE. This is a way for a user to ‘save’ a clustering as important/final so it will not be changed internally by any method, yet still have it show up with the “workflow” clustering results. There is no limit to the number of such clusters that are so marked, but the utility of doing so will drop if too many such clusters are chosen.

    =======

    The workflow commands check for existing clusters of the workflow (based on the clusterTypes of the clusterings). If there exist clusterings from previous runs and such clusterings came from calls that are “downstream” of the requested clustering, then the method will change their clusterTypes value by adding a “.i”, where \(i\) is a numerical index keeping track of replicate calls.

    For example, suppose I rerun ‘combineMany’, say with a different parameter choice of the proportion similarity to require. Then combineMany searches the existing clusterings in my imput object. Any existing combineMany results will have their clusterTypes changed from combineMany to combineMany.x, where \(x\) is the largest such number needed to be greater than any existing combineMany.x (after all, you might do this many times!). Moreover, since mergeClusters is downstream of combineMany in the workflow, currently existing mergeClusters will also get bumped to mergeClusters.x. However, clusterMany is upstream of combineMany (i.e. you expect there to be existing clusterMany before you run combineMany) so nothing will happen to clusterMany.

    This is handled internally, and may never be apparent to the user unless they choose whichClusters="all" in a plotting command. Indeed this is one reason to always pick whichClusters="workflow", so that these saved previous versions are not dispayed.

    Note that there is nothing that governs or protects the clusterTypes values to be of a certain kind. This means that if the user decides to name a clusterTypes of a clustering one of these protected names, that is allowed.

    Erasing old clusters You can also choose to have all old versions erased by choosing the options eraseOld=TRUE in the call to clusterMany, combineMany and/or mergeClusters. eraseOld=TRUE will delete ALL past workflow results except for those that are both in the current workflow and “upstream” of the requested command. You can also manually remove clusters with removeClusters.

    Finding workflow iterations Sometimes which numbered iteration a particular call is in will not be obvious if there are many calls to the workflow. You may have a mergeClusters.2 cluster but no mergeClusters.1 because of an upstream workflow call in the middle that bumped the iteration value up to 2 without ever making a mergeClusters.1. If you really want to, you can see more about the existing iterations and where they are in the clusterMatrix.

    workflowClusterTable(ce)
    ##                Iteration
    ## Type             0  1  2  3  4
    ##   final          0  0  0  0  0
    ##   mergeClusters  1  0  0  1  1
    ##   combineMany    1  1  1  1  1
    ##   clusterMany   36  0  0  0  0

    Explicit details about every workflow cluster and their index in clusterMatrix is given by workflowClusterDetails:

    head(workflowClusterDetails(ce),8)
    ##   index          type iteration               label
    ## 1     1 mergeClusters         0       mergeClusters
    ## 2     2 mergeClusters         4     mergeClusters.4
    ## 3     3   combineMany         0 combineMany,finalV2
    ## 4     4   combineMany         4    combineMany,nVar
    ## 5     5 mergeClusters         3     mergeClusters.3
    ## 6     6   combineMany         3   combineMany,final
    ## 7     7   combineMany         2     combineMany,0.7
    ## 8     8   combineMany         1 combineMany,default

    5.4.1 Designate a Final Clustering

    A final protected clusterTypes is “final”. This is not created by any method, but can be set as the clusterTypes of a clustering by the user (via the clusterTypes command). Any clustering marked final will be be considered one of the workflow values for commands like whichClusters="workflow". However, they will NOT be renamed with “.x” or removed if eraseOld=TRUE. This is a way for a user to ‘save’ a clustering as important/final so it will not be changed internally by any method, yet still have it show up with the “workflow” clustering results. There is no limit to the number of such clusters that are so marked, but the utility of doing so will drop if too many such clusters are chosen.

    >>>>>>> release/0.1.0

    For best functionality, particularly if a user has determined a single final clustering after completing clustering, a user will probably want to set the primaryClusterIndex to be that of the final cluster and rerun makeDendrogram. This will help in plotting and visualizing.

    Here we will go back to our previous mergeClusters that we found with cutoff=0.05 and mark it as our final clustering. First we need to find which cluster it is. We see from our above call to the workflow functions above, that it is mergeClusters.4 (we also see that if we had labelled these with clusterLabels, it would be much easier to track down!). Now we mark the “mergeClusters.4” version as the final one by changing it’s clusterTypes manually to “final”.

    whFinal<-which(clusterTypes(ce)=="mergeClusters.4")
    if(length(whFinal)!=1) stop() else clusterTypes(ce)[whFinal]<-"final"
    par(mar=plotCMar)
    plotClusters(ce,whichClusters="workflow")
    <<<<<<< HEAD

    =======

    Note that because it is labeled as final it shows up automatically in “workflow” clusters in our plotClusters command. If we are settled on this cluster being our primary focus of plots, dendrograms, etc., then we probably want to also set it as the primaryCluster as well as change it’s clusterLabel to be more meaningful for plots, etc. We also will make a dendrogram to go with it for later use (recall any dendrogram we had will be lost when we change the primaryCluster):

    clusterLabels(ce)[whFinal]<-"My Final Clustering"
    primaryClusterIndex(ce)<-whFinal
    ce<-makeDendrogram(ce,dimReduce="mostVar",ndims=500)

    This didn’t get rid of our undesired mergeClusters result that is most recent. It still shows up as “the”" mergeClusters result. This might be undesired. We could remove it with removeClusters or manually change the clusterTypes to mergeClusters.x so that it doesn’t show up as current. We can also change it’s label, without changing its type, so as to be more informative.

    >>>>>>> release/0.1.0

    6 Finding Features related to a Clustering

    The function getBestFeatures finds features in the data that are strongly differentiated between the clusters of a given clustering. Finding the best features is generally the last step in the workflow, once a final clustering has been decided upon, though as we have seen it is also called internally in mergeClusters to decide between which clusters to merge together.

    The function getBestFeatures calls limma on input data to determine the gene features most associated with a particular clustering. getBestFeatures picks the primaryCluster of a ClusterExperiment object as the clustering to use to find features. If the standard workflow is followed, this will be the last completed step (usually the result of mergeClusters). This can be changed by setting primaryClusterIndex to point to a different clustering.

    The basic implementation is that of limma which fits a linear model per feature and tests for the significance of parameters of that linear model. The main contribution of getBestFeatures is to interface with limma so as to pick appropriate parameters or tests for comparing clusters. Naturally, getBestFeatures also seamlessly works with ClusterExperiment objects to minimize the burden on the user. The output is in the form of topTable in limma, i.e. a data.frame giving the relevant features, the p-value, etc.

    6.1 Types of Significance Tests

    There are several choices of what is the most appropriate test to determine whether a feature is differentially expressed across the clusterings. All of these methods first fit a linear model where the clusters categories of the clustering is the explanatory factor in the model (samples with -1 or -2 are ignored). The methods differ only in what significance tests they then perform, which is controlled by the argument type. By default, getBestFeatures finds significant genes based on a F-test between the clusters (type="F"). This is a very standard test to compare clusters, which is why it is the default, however it may not be the one that gives the best or most specific results. Indeed, in our “Quick Start”, we did not do the \(F\) test, but rather all pair-wise comparisons between the clusters.

    The \(F\) test is a test for whether there are any differences in expression between the clusters for a feature. Three other options are available that try to detect instead specific kinds of differences between clusters that might be of greatest interest. Specifically, these differences are encoded as “contrasts”, meaning specific types of differences between the means of clusters.

    Note that for all of these contrasts, we are making use of all of the data, not just the samples in the particular cluster pairs being compared. This means the variance is estimated with all the samples. Indeed, the same linear model is being used for all of these comparisons.

    <<<<<<< HEAD

    Pairs: The option type="Pairs", which we saw earlier, performs all pair-wise tests between the clusters for each feature, testing for each pair of clusters whether the mean of the feature is different between the two clusters. Here is the example from above using all pairwise comparisons:

    pairsAllTop<-getBestFeatures(ce,type="Pairs",p.value=0.05)
    ## Error in topTableF(fit2, genelist = rownames(fit$coef), ...): unused argument (type = "Pairs")
    dim(pairsAllTop)
    ## Warning: restarting interrupted promise evaluation
    ## Error in eval(expr, envir, enclos): cannot open file '/Users/epurdom/Documents/Sequencing/SingleCell/clusterExperiment/vignettes/clusterExperimentTutorial_cache/getBestFeatures_onlyTopPairs_945df7fd6347ce2d786026ebe99582e1.rdb': No such file or directory
    head(pairsAllTop)
    ## Warning in head(pairsAllTop): restarting interrupted promise evaluation
    ## Error in head(pairsAllTop): cannot open file '/Users/epurdom/Documents/Sequencing/SingleCell/clusterExperiment/vignettes/clusterExperimentTutorial_cache/getBestFeatures_onlyTopPairs_945df7fd6347ce2d786026ebe99582e1.rdb': No such file or directory

    Notice that compared to the quick start guide, I didn’t set the parameter number which is passed to topTable, so I can get out at most 10 significant features for each contrast/comparison (because the default value of number in topTable is 10). Similarly, if I didn’t set a value for p.value, topTable would return the top number genes per contrast, regardless of whether they were all significant or not. These are the defaults of topTable, which we purposefully do not modify, but we urge the user to read the documentation of topTable carefully to understand what is being asked for. In the QuickStart, we set number=NROW(ce) to make sure we got all significant genes.

    In addition to the columns provided by topTable, the column Contrast tells us what pairwise contrast the result is from. X1-X2 means a comparison of cluster 1 and cluster 2. The column IndexInOriginal gives the index of the gene to the original input data matrix, namely assay(ce). The other columns are given by topTable (with the column Feature renamed – it is usually ProbeID in limma).

    OneAgainsAll: The choice type="OneAgainsAll" performs a comparison of a cluster against the mean of all of the other clusters.

    best1vsAll<-getBestFeatures(ce,type="OneAgainstAll",p.value=0.05)
    ## Error in topTableF(fit2, genelist = rownames(fit$coef), ...): unused argument (type = "OneAgainstAll")
    head(best1vsAll)
    ## Warning in head(best1vsAll): restarting interrupted promise evaluation
    ## Error in head(best1vsAll): cannot open file '/Users/epurdom/Documents/Sequencing/SingleCell/clusterExperiment/vignettes/clusterExperimentTutorial_cache/getBestFeatures_oneAgainstAll_b339d8a773cc9cbf64d16d5e1297aafa.rdb': No such file or directory
    =======

    6.1.1 All Pairwise:

    The option type="Pairs", which we saw earlier, performs all pair-wise tests between the clusters for each feature, testing for each pair of clusters whether the mean of the feature is different between the two clusters. Here is the example from above using all pairwise comparisons:

    pairsAllTop<-getBestFeatures(ce,contrastType="Pairs",p.value=0.05)
    dim(pairsAllTop)
    ## [1] 10  9
    head(pairsAllTop)
    ##   IndexInOriginal Contrast Feature     logFC  AveExpr          t
    ## 1            5744    X1-X2   STMN2  9.237481 8.478084  13.060756
    ## 2            2209    X1-X2    GLI3 -7.421220 5.573895 -10.743575
    ## 3            4284    X1-X2  PLXNA2  7.476341 5.276724   9.674490
    ## 4            5084    X1-X2    RTN1  7.195767 7.624160   9.384104
    ## 5             406    X1-X2   ATCAY  7.045734 4.951946   9.236713
    ## 6            5275    X1-X2   SFRP1 -7.375415 5.788717  -9.224464
    ##        P.Value    adj.P.Val        B
    ## 1 4.391031e-21 3.102703e-17 35.69468
    ## 2 6.924754e-17 2.446515e-13 27.02199
    ## 3 7.126175e-15 1.678452e-11 22.79508
    ## 4 2.541467e-14 4.489502e-11 21.62873
    ## 5 4.853301e-14 6.031500e-11 21.03436
    ## 6 5.121568e-14 6.031500e-11 20.98490

    Notice that compared to the quick start guide, I didn’t set the parameter number which is passed to topTable, so I can get out at most 10 significant features for each contrast/comparison (because the default value of number in topTable is 10). Similarly, if I didn’t set a value for p.value, topTable would return the top number genes per contrast, regardless of whether they were all significant or not. These are the defaults of topTable, which we purposefully do not modify, but we urge the user to read the documentation of topTable carefully to understand what is being asked for. In the QuickStart, we set number=NROW(ce) to make sure we got all significant genes.

    In addition to the columns provided by topTable, the column Contrast tells us what pairwise contrast the result is from. X1-X2 means a comparison of cluster 1 and cluster 2. The column IndexInOriginal gives the index of the gene to the original input data matrix, namely assay(ce). The other columns are given by topTable (with the column Feature renamed – it is usually ProbeID in limma).

    6.1.2 One Against All

    The choice type="OneAgainsAll" performs a comparison of a cluster against the mean of all of the other clusters.

    best1vsAll<-getBestFeatures(ce,contrastType="OneAgainstAll",p.value=0.05)
    head(best1vsAll)
    ##   IndexInOriginal ContrastName  Contrast Feature     logFC  AveExpr
    ## 1            5744            1 (X2)/1-X1   STMN2 -9.237481 8.478084
    ## 2            2209            1 (X2)/1-X1    GLI3  7.421220 5.573895
    ## 3            4284            1 (X2)/1-X1  PLXNA2 -7.476341 5.276724
    ## 4            5084            1 (X2)/1-X1    RTN1 -7.195767 7.624160
    ## 5             406            1 (X2)/1-X1   ATCAY -7.045734 4.951946
    ## 6            5275            1 (X2)/1-X1   SFRP1  7.375415 5.788717
    ##            t      P.Value    adj.P.Val        B
    ## 1 -13.060756 4.391031e-21 3.102703e-17 35.69468
    ## 2  10.743575 6.924754e-17 2.446515e-13 27.02199
    ## 3  -9.674490 7.126175e-15 1.678452e-11 22.79508
    ## 4  -9.384104 2.541467e-14 4.489502e-11 21.62873
    ## 5  -9.236713 4.853301e-14 6.031500e-11 21.03436
    ## 6   9.224464 5.121568e-14 6.031500e-11 20.98490
    >>>>>>> release/0.1.0

    Notice that now there is both a “Contrast” and a “ContrastName” column. Like before, “Contrast” gives an explicit definition of what is the comparisons, in the form of “(X2+X3+X4)/3-X1”, meaning the mean of the means of cluster 2,3,4 is compared to the mean of cluster1. “ContrastName” interprets this into a more usable name, namely that this contrast can be easily identified as a test of “cluster 1”.

    6.1.3 Dendrogram

    The option type="Dendro" is more complex; it assumes that there is a hiearchy of the clusters (created by makeDendrogram and stored in the ClusterExperiment object). Then for each node of the dendrogram, defines a contrast or comparison of the mean expression between the daughter nodes.

    To run this option, it is important that makeDendrogram corresponds to the dendrogram you want (that the relevant set of features were chosen, etc., see makeDendrogram above). For this reason, it is good to rerun the makeDendrogram command first.

    ce<-makeDendrogram(ce)
    <<<<<<< HEAD
    bestDendro<-getBestFeatures(ce,type="Dendro",p.value=0.05)
    ## Error in topTableF(fit2, genelist = rownames(fit$coef), ...): unused argument (type = "Dendro")
    head(bestDendro)
    ## Error in head(bestDendro): cannot open file '/Users/epurdom/Documents/Sequencing/SingleCell/clusterExperiment/vignettes/clusterExperimentTutorial_cache/getBestFeatures_dendro_5941f68d32d90550681e7d54500b9d0f.rdb': No such file or directory
    ======= bestDendro<-getBestFeatures(ce,contrastType="Dendro",p.value=0.05) head(bestDendro)
    ##   IndexInOriginal ContrastName Contrast Feature     logFC  AveExpr
    ## 1            5744        Node1    X1-X2   STMN2  9.237481 8.478084
    ## 2            2209        Node1    X1-X2    GLI3 -7.421220 5.573895
    ## 3            4284        Node1    X1-X2  PLXNA2  7.476341 5.276724
    ## 4            5084        Node1    X1-X2    RTN1  7.195767 7.624160
    ## 5             406        Node1    X1-X2   ATCAY  7.045734 4.951946
    ## 6            5275        Node1    X1-X2   SFRP1 -7.375415 5.788717
    ##            t      P.Value    adj.P.Val        B
    ## 1  13.060756 4.391031e-21 3.102703e-17 35.69468
    ## 2 -10.743575 6.924754e-17 2.446515e-13 27.02199
    ## 3   9.674490 7.126175e-15 1.678452e-11 22.79508
    ## 4   9.384104 2.541467e-14 4.489502e-11 21.62873
    ## 5   9.236713 4.853301e-14 6.031500e-11 21.03436
    ## 6  -9.224464 5.121568e-14 6.031500e-11 20.98490
    >>>>>>> release/0.1.0

    Again, there is both a “ContrastName” and “Contrast” column. The “Contrast” column identifies which clusters where on each side of the node (and hence commpared) and “ContrastName” is a name for the node.

    levels((bestDendro)$Contrast)
    ## Warning in levels((bestDendro)$Contrast): restarting interrupted promise
    ## evaluation
    ## Warning in levels((bestDendro)$Contrast): internal error -3 in
    ## R_decompress1
    ## Error in levels((bestDendro)$Contrast): lazy-load database '/Users/epurdom/Documents/Sequencing/SingleCell/clusterExperiment/vignettes/clusterExperimentTutorial_cache/getBestFeatures_dendro_5941f68d32d90550681e7d54500b9d0f.rdb' is corrupt

    6.2 Analysis for count and other RNASeq data

    The getBestFeatures method for ClusterExperiment objects has an argument isCount. If this is marked TRUE then the data in assay(x) is assumed to be counts, and the call to limma uses the voom correction. This correction deals with the mean-variance relationship that is found with count data. This means that the differential expression analysis is done on \(log_2(x+0.5)\). This is regardless of what transformation is stored in the ClusterExperiment object! The voom call within getBestFeatures however, is by default set to normalize.method = "none" in the call to voom (though the user can set normalize.method in the call to getBestFeatures).

    If instead isCount=FALSE, then limma is performed on transform(x), i.e. the data after transformation of the data with the transformation stored in the ClusterExperiment object. In this case, there is no voom correction.

    Unlike edgeR or DESeq, the voom correction does not explicitly require a count matrix, and therefore it has been proposed that it can be used on FPKM or TPM entries, or data normalized via RUV. Setting isCount=TRUE even if the data in the assay slot is not count will have this effect. However, the authors of the package do not recommend using voom on anything other than counts, see e.g. this discussion.

    6.3 Additional considerations

    The user should be careful about questions of multiple comarisons when all of these multiple contrasts are being performed on each feature; the default is to correct across all of these tests (see the help of getBestFeatures and the argument contrastAdj for more). As noted in the introduction, p-values created in this way are reusing the data (since the data was also used for creating the clusters) and hence should not be considered valid p-values regardless.

    <<<<<<< HEAD

    As mentioned, getBestFeatures accepts arguments to limma’s function topTable to decide which genes should be returned (and in what order). In particular, we can set an adjusted p-value cutoff for each contrast, and set number to control the number of genes returned for each contrast. By setting number to be the length of all genes, and p.value=0.05, we can return all genes for each contrast that have adjusted p-values less than 0.05. All of the arguments to topTable regarding what results are returned and in what order can be given by the user at the call to getBestFeatures.

    =======

    As mentioned, getBestFeatures accepts arguments to limma’s function topTable to decide which genes should be returned (and in what order). In particular, we can set an adjusted p-value cutoff for each contrast, and set number to control the number of genes returned for each contrast. By setting number to be the length of all genes, and p.value=0.05, we can return all genes for each contrast that have adjusted p-values less than 0.05. All of the arguments to topTable regarding what results are returned and in what order can be given by the user at the call to getBestFeatures.

    >>>>>>> release/0.1.0

    Tseng, George C., and Wing H. Wong. 2005. “Tight Clustering: A Resampling-Based Approach for Identifying Stable and Tight Patterns in Data.” Biometrics 61 (1): 10–16.